Explore and Summarize Red Wine Quality Data Set


by Rawan Alaufi

Introduction

In this project I will explore and summerize red win data set with R programming language to identify which features are most correlates with quality of wine.


Brief Description of Attributes


1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
12 - quality (score between 0 and 10)

Univariate Plots Section

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Our dataset consists of 12 variables, with almost 1599 observations without any missing values.


Most of wines quality graded as 5

Alcohol content of wine ranges from 9 to 14, but some are over 14 and some less than 9
As we see in the above plot pH of wine is between 3 and 4, most of wines contain pH > 3 and < 3.5
Maximum and minumum pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010


Above plot shows sulphates contents in wines , most wines contain sulphates between 0.5 and 1
Above plot shows normall distribution of density of wines.


Aas we see in the above plot most wines in this datset contain low volatile acidity.
Fixed acidity plot is right skewed with some outliers which means most wines contains fixed acidity greater than the median. the medain is 7.9 # Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density,pH, sulphates, alcohol and quality). The variable quality is ordered factor variables with the following levels.

(worst) -> (best)
quality: score between 0 and 10,this dataset contains only from 3 to 8.
Other observations:
The median quality score is 6.
Most of wine contain 3.3 pH.
The median chlorides is 0.012 and the max is 0.611.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is the quality grade of wine, I?d like to determine which features are best for predicting the quality of a diamond. I suspect alcohol and some combination of the other variables can be used to build a predictive model to quality grade of wine. ### What other features in the dataset do you think will help support your
I think alcohol, pH and residual sugar

Did you create any new variables from existing variables in the dataset?


No ### Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form the data?
No

Bivariate Plots Section

From the above chart, free.sulfur, residual.sugar and pH do not seem to have strong correlations with quality, most correlated variable with quality is alcohol, also density is the most variable correlated with alcohol. I want to look closer at plots involving quality and some other variables like alcohol and density.


Box plot for each quality grade score, the median of sulphates increases when quality of wine increased.
The above plot represent the relationship between quality and alcohol,high quality wine contain high alcohol content.
The above scatter plot represents negative correlation, density decreases when alcohol is increased.
Tha chart of correlation matrix here shows the correlation coefficient of pH and fixed.acidity which is -0.68 that’s mean there is a strong negative correlation between them, The above scatter plot shows this correlation, pH increases when fixed.acidity decreased.
# Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality correlates with alcohol and sulphates, alcohol correlates with density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes there is a strong relationship between alcohol and density. ### What was the strongest relationship you found? Quality correlates strongly with alcohol, correlation coefficient of them is 0.5 and negative correlation between density and alcohol with -0.5 correlation coefficient.
# Multivariate Plots Section


Dark blue points that repesent Worst quality are on section of high density and low alcohol.
As the plot shows higher quality have low volatile acidity and high alcohol.


The above plot shows strong correlation between free.sulfur.dioxide and total.sulfur.dioxide by alcohol precentage,free.sulfur.dioxide increases when total.sulfur.dioxide increased.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There is a strong relationship between quality, alcohol and density.

Were there any interesting or surprising interactions between features?


NO

Final Plots and Summary

Plot One


The most of wine quality score is 5 and 6.

Plot Two


The above heat map plot shows the best quality grade score comes with high alcohol precentage

Plot Three


Worst quality score comes with high density and low alcohol precentage.


Reflection

That’s my first R project, I didn’t use R programming language before, I have chosen red wine data set which contains 1599 observations and 12 features. it tooks me more than 10 hours to explore it was not difficult it’s so enjoyable. When I learn R more I will go back to it and explore it more.